Search | VHL Regional Portal

1.

Exact and efficient phylodynamic simulation from arbitrarily large populations.

Celentano, Michael; DeWitt, William S; Prillo, Sebastian; Song, Yun S.

ArXiv ; 2024 Feb 27.

Article in English | MEDLINE | ID: mdl-38463501

ABSTRACT

Many biological studies involve inferring the genealogical history of a sample of individuals from a large population and interpreting the reconstructed tree. Such an ascertained tree typically represents only a small part of a comprehensive population tree and is distorted by survivorship and sampling biases. Inferring evolutionary parameters from ascertained trees requires modeling both the underlying population dynamics and the ascertainment process. A crucial component of this phylodynamic modeling involves tree simulation, which is used to benchmark probabilistic inference methods. To simulate an ascertained tree, one must first simulate the full population tree and then prune unobserved lineages. Consequently, the computational cost is determined not by the size of the final simulated tree, but by the size of the population tree in which it is embedded. In most biological scenarios, simulations of the entire population are prohibitively expensive due to computational demands placed on lineages without sampled descendants. Here, we address this challenge by proving that, for any partially ascertained process from a general multi-type birth-death-mutation-sampling (BDMS) model, there exists an equivalent pure birth process (i.e., no death) with mutation and complete sampling. The final trees generated under these processes have exactly the same distribution. Leveraging this property, we propose a highly efficient algorithm for simulating trees under a general BDMS model. Our algorithm scales linearly with the size of the final simulated tree and is independent of the population size, enabling simulations from extremely large populations beyond the reach of current methods but essential for various biological applications. We anticipate that this unprecedented speedup will significantly advance the development of novel inference methods that require extensive training data.

2.

Highly parameterized polygenic scores tend to overfit to population stratification via random effects.

Aw, Alan J; McRae, Jeremy; Rahmani, Elior; Song, Yun S.

bioRxiv ; 2024 Jan 29.

Article in English | MEDLINE | ID: mdl-38352303

ABSTRACT

Polygenic scores (PGSs), increasingly used in clinical settings, frequently include many genetic variants, with performance typically peaking at thousands of variants. Such highly parameterized PGSs often include variants that do not pass a genome-wide significance threshold. We propose a mathematical perspective that renders the effects of many of these non-significant variants random rather than causal, with the randomness capturing population structure. We devise methods to assess variant effect randomness and population stratification bias. Applying these methods to 141 traits from the UK Biobank, we find that, for many PGSs, the effects of non-significant variants are considerably random, with the extent of randomness associated with the degree of overfitting to population structure of the discovery cohort. Our findings explain why highly parameterized PGSs simultaneously have superior cohort-specific performance and limited generalizability, suggesting the critical need for variant randomness tests in PGS evaluation. Supporting code and a dashboard are available at https://github.com/songlab-cal/StratPGS.

3.

GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction.

Benegas, Gonzalo; Albors, Carlos; Aw, Alan J; Ye, Chengzhong; Song, Yun S.

bioRxiv ; 2024 Apr 06.

Article in English | MEDLINE | ID: mdl-37873118

ABSTRACT

Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts have not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially in complex genomes such as that of humans. To address this challenge, we here introduce GPN-MSA, a novel framework for DNA language models that leverages whole-genome sequence alignments across multiple species and takes only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, OMIM), experimental functional assays (DMS, DepMap), and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and non-coding variants.

4.

ConvexML: Scalable and accurate inference of single-cell chronograms from CRISPR/Cas9 lineage tracing data.

Prillo, Sebastian; Ravoor, Akshay; Yosef, Nir; Song, Yun S.

bioRxiv ; 2023 Dec 03.

Article in English | MEDLINE | ID: mdl-38076815

ABSTRACT

CRISPR/Cas9 gene editing technology has enabled lineage tracing for thousands of cells in vivo. However, most of the analysis of CRISPR/Cas9 lineage tracing data has so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population - among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we leverage a statistical model of CRISPR/Cas9 cutting with missing data, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. As part of our method, we propose a novel approach to represent and handle missing data - specifically, double-resection events - which greatly simplifies and speeds up branch length estimation without compromising quality. All this leads to a convex maximum likelihood estimation (MLE) problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. To stabilize estimates in low-information regimes, we propose a simple penalized version of MLE using a minimum branch length and pseudocounts. We benchmark our method using simulations and show that it performs well on several tasks, outperforming more naive baselines. Our method, which we name 'ConvexML', is available through the cassiopeia open source Python package.

5.

Tracing cancer evolution and heterogeneity using Hi-C.

Erdmann-Pham, Dan Daniel; Batra, Sanjit Singh; Turkalo, Timothy K; Durbin, James; Blanchette, Marco; Yeh, Iwei; Shain, Hunter; Bastian, Boris C; Song, Yun S; Rokhsar, Daniel S; Hockemeyer, Dirk.

Nat Commun ; 14(1): 7111, 2023 11 06.

Article in English | MEDLINE | ID: mdl-37932252

ABSTRACT

Chromosomal rearrangements can initiate and drive cancer progression, yet it has been challenging to evaluate their impact, especially in genetically heterogeneous solid cancers. To address this problem we developed HiDENSEC, a new computational framework for analyzing chromatin conformation capture in heterogeneous samples that can infer somatic copy number alterations, characterize large-scale chromosomal rearrangements, and estimate cancer cell fractions. After validating HiDENSEC with in silico and in vitro controls, we used it to characterize chromosome-scale evolution during melanoma progression in formalin-fixed tumor samples from three patients. The resulting comprehensive annotation of the genomic events includes copy number neutral translocations that disrupt tumor suppressor genes such as NF1, whole chromosome arm exchanges that result in loss of CDKN2A, and whole-arm copy-number neutral loss of homozygosity involving PTEN. These findings show that large-scale chromosomal rearrangements occur throughout cancer evolution and that characterizing these events yields insights into drivers of melanoma progression.

Subject(s)

Chromosome Aberrations , Melanoma , Humans , DNA Copy Number Variations , Chromosomes , Translocation, Genetic , Melanoma/genetics

6.

Predicting the effect of CRISPR-Cas9-based epigenome editing.

Batra, Sanjit Singh; Cabrera, Alan; Spence, Jeffrey P; Hilton, Isaac B; Song, Yun S.

bioRxiv ; 2023 Oct 03.

Article in English | MEDLINE | ID: mdl-37873127

ABSTRACT

Epigenetic regulation orchestrates mammalian transcription, but functional links between them remain elusive. To tackle this problem, we here use epigenomic and transcriptomic data from 13 ENCODE cell types to train machine learning models to predict gene expression from histone post-translational modifications (PTMs), achieving transcriptome-wide correlations of ~ 0.70 - 0.79 for most samples. In addition to recapitulating known associations between histone PTMs and expression patterns, our models predict that acetylation of histone subunit H3 lysine residue 27 (H3K27ac) near the transcription start site (TSS) significantly increases expression levels. To validate this prediction experimentally and investigate how engineered vs. natural deposition of H3K27ac might differentially affect expression, we apply the synthetic dCas9-p300 histone acetyltransferase system to 8 genes in the HEK293T cell line. Further, to facilitate model building, we perform MNase-seq to map genome-wide nucleosome occupancy levels in HEK293T. We observe that our models perform well in accurately ranking relative fold changes among genes in response to the dCas9-p300 system; however, their ability to rank fold changes within individual genes is noticeably diminished compared to predicting expression across cell types from their native epigenetic signatures. Our findings highlight the need for more comprehensive genome-scale epigenome editing datasets, better understanding of the actual modifications made by epigenome editing tools, and improved causal models that transfer better from endogenous cellular measurements to perturbation experiments. Together these improvements would facilitate the ability to understand and predictably control the dynamic human epigenome with consequences for human health.

7.

DNA language models are powerful predictors of genome-wide variant effects.

Benegas, Gonzalo; Batra, Sanjit Singh; Song, Yun S.

Proc Natl Acad Sci U S A ; 120(44): e2311219120, 2023 Oct 31.

Article in English | MEDLINE | ID: mdl-37883436

ABSTRACT

The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pretraining on large protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using an unsupervised approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pretraining on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order and evaluate its ability to predict the functional impact of genetic variants in A. thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for A. thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling unsupervised prediction of variant effects across the entire genome.

Subject(s)

Arabidopsis , Arabidopsis/genetics , Genome-Wide Association Study , Genomics , Genome , DNA

8.

Cross-protein transfer learning substantially improves disease variant prediction.

Jagota, Milind; Ye, Chengzhong; Albors, Carlos; Rastogi, Ruchir; Koehl, Antoine; Ioannidis, Nilah; Song, Yun S.

Genome Biol ; 24(1): 182, 2023 08 07.

Article in English | MEDLINE | ID: mdl-37550700

ABSTRACT

BACKGROUND: Genetic variation in the human genome is a major determinant of individual disease risk, but the vast majority of missense variants have unknown etiological effects. Here, we present a robust learning framework for leveraging saturation mutagenesis experiments to construct accurate computational predictors of proteome-wide missense variant pathogenicity. RESULTS: We train cross-protein transfer (CPT) models using deep mutational scanning (DMS) data from only five proteins and achieve state-of-the-art performance on clinical variant interpretation for unseen proteins across the human proteome. We also improve predictive accuracy on DMS data from held-out proteins. High sensitivity is crucial for clinical applications and our model CPT-1 particularly excels in this regime. For instance, at 95% sensitivity of detecting human disease variants annotated in ClinVar, CPT-1 improves specificity to 68%, from 27% for ESM-1v and 55% for EVE. Furthermore, for genes not used to train REVEL, a supervised method widely used by clinicians, we show that CPT-1 compares favorably with REVEL. Our framework combines predictive features derived from general protein sequence models, vertebrate sequence alignments, and AlphaFold structures, and it is adaptable to the future inclusion of other sources of information. We find that vertebrate alignments, albeit rather shallow with only 100 genomes, provide a strong signal for variant pathogenicity prediction that is complementary to recent deep learning-based models trained on massive amounts of protein sequence data. We release predictions for all possible missense variants in 90% of human genes. CONCLUSIONS: Our results demonstrate the utility of mutational scanning data for learning properties of variants that transfer to unseen proteins.

Subject(s)

Machine Learning , Proteome , Humans , Proteome/genetics , Amino Acid Sequence , Mutation , Mutation, Missense , Computational Biology/methods

9.

CELL-E: A Text-To-Image Transformer for Protein Localization Prediction.

Khwaja, Emaad; Song, Yun S; Huang, Bo.

Res Sq ; 2023 Jun 02.

Article in English | MEDLINE | ID: mdl-37398207

ABSTRACT

Accurately predicting cellular activities of proteins based on their primary amino acid sequences would greatly improve our understanding of the proteome. In this paper, we present CELL-E, a text-to-image transformer model that generates 2D probability density images describing the spatial distribution of proteins within cells. Given an amino acid sequence and a reference image for cell or nucleus morphology, CELL-E predicts a more refined representation of protein localization, as opposed to previous in silico methods that rely on pre-defined, discrete class annotations of protein localization to subcellular compartments.

10.

CherryML: scalable maximum likelihood estimation of phylogenetic models.

Prillo, Sebastian; Deng, Yun; Boyeau, Pierre; Li, Xingyu; Chen, Po-Yen; Song, Yun S.

Nat Methods ; 20(8): 1232-1236, 2023 08.

Article in English | MEDLINE | ID: mdl-37386188

ABSTRACT

Phylogenetic models of molecular evolution are central to numerous biological applications spanning diverse timescales, from hundreds of millions of years involving orthologous proteins to just tens of days relating to single cells within an organism. A fundamental problem in these applications is estimating model parameters, for which maximum likelihood estimation is typically employed. Unfortunately, maximum likelihood estimation is a computationally expensive task, in some cases prohibitively so. To address this challenge, we here introduce CherryML, a broadly applicable method that achieves several orders of magnitude speedup by using a quantized composite likelihood over cherries in the trees. The massive speedup offered by our method should enable researchers to consider more complex and biologically realistic models than previously possible. Here we demonstrate CherryML's utility by applying it to estimate a general 400 × 400 rate matrix for residue-residue coevolution at contact sites in three-dimensional protein structures; we estimate that using current state-of-the-art methods such as the expectation-maximization algorithm for the same task would take >100,000 times longer.

Subject(s)

Evolution, Molecular , Proteins , Phylogeny , Likelihood Functions , Algorithms , Models, Genetic

11.

A fast machine-learning-guided primer design pipeline for selective whole genome amplification.

Dwivedi-Yu, Jane A; Oppler, Zachary J; Mitchell, Matthew W; Song, Yun S; Brisson, Dustin.

PLoS Comput Biol ; 19(4): e1010137, 2023 04.

Article in English | MEDLINE | ID: mdl-37068103

ABSTRACT

Addressing many of the major outstanding questions in the fields of microbial evolution and pathogenesis will require analyses of populations of microbial genomes. Although population genomic studies provide the analytical resolution to investigate evolutionary and mechanistic processes at fine spatial and temporal scales-precisely the scales at which these processes occur-microbial population genomic research is currently hindered by the practicalities of obtaining sufficient quantities of the relatively pure microbial genomic DNA necessary for next-generation sequencing. Here we present swga2.0, an optimized and parallelized pipeline to design selective whole genome amplification (SWGA) primer sets. Unlike previous methods, swga2.0 incorporates active and machine learning methods to evaluate the amplification efficacy of individual primers and primer sets. Additionally, swga2.0 optimizes primer set search and evaluation strategies, including parallelization at each stage of the pipeline, to dramatically decrease program runtime. Here we describe the swga2.0 pipeline, including the empirical data used to identify primer and primer set characteristics, that improve amplification performance. Additionally, we evaluate the novel swga2.0 pipeline by designing primer sets that successfully amplify Prevotella melaninogenica, an important component of the lung microbiome in cystic fibrosis patients, from samples dominated by human DNA.

Subject(s)

Genome , Genomics , Humans , Sequence Analysis, DNA/methods , DNA

12.

The Impact of Stability Considerations on Genetic Fine-Mapping.

Aw, Alan; Jin, Lionel Chentian; Ioannidis, Nilah; Song, Yun S.

bioRxiv ; 2023 Apr 13.

Article in English | MEDLINE | ID: mdl-37090514

ABSTRACT

Fine-mapping methods, which aim to identify genetic variants responsible for complex traits following genetic association studies, typically assume that sufficient adjustments for confounding within the association study cohort have been made, e.g., through regressing out the top principal components (i.e., residualization). Despite its widespread use, however, residualization may not completely remove all sources of confounding. Here, we propose a complementary stability-guided approach that does not rely on residualization, which identifies consistently fine-mapped variants across different genetic backgrounds or environments. We demonstrate the utility of this approach by applying it to fine-map eQTLs in the GEUVADIS data. Using 378 different functional annotations of the human genome, including recent deep learning-based annotations (e.g., Enformer), we compare enrichments of these annotations among variants for which the stability and traditional residualization-based fine-mapping approaches agree against those for which they disagree, and find that the stability approach enhances the power of traditional fine-mapping methods in identifying variants with functional impact. Finally, in cases where the two approaches report distinct variants, our approach identifies variants comparably enriched for functional annotations. Our findings suggest that the stability principle, as a conceptually simple device, complements existing approaches to fine-mapping, reinforcing recent advocacy of evaluating cross-population and cross-environment portability of biological findings. To support visualization and interpretation of our results, we provide a Shiny app, available at: https://alan-aw.shinyapps.io/stability_v0/.

13.

The ENCODE Imputation Challenge: a critical assessment of methods for cross-cell type imputation of epigenomic profiles.

Schreiber, Jacob; Boix, Carles; Wook Lee, Jin; Li, Hongyang; Guan, Yuanfang; Chang, Chun-Chieh; Chang, Jen-Chien; Hawkins-Hooker, Alex; Schölkopf, Bernhard; Schweikert, Gabriele; Carulla, Mateo Rojas; Canakoglu, Arif; Guzzo, Francesco; Nanni, Luca; Masseroli, Marco; Carman, Mark James; Pinoli, Pietro; Hong, Chenyang; Yip, Kevin Y; Spence, Jeffrey P; Batra, Sanjit Singh; Song, Yun S; Mahony, Shaun; Zhang, Zheng; Tan, Wuwei; Shen, Yang; Sun, Yuanfei; Shi, Minyi; Adrian, Jessika; Sandstrom, Richard; Farrell, Nina; Halow, Jessica; Lee, Kristen; Jiang, Lixia; Yang, Xinqiong; Epstein, Charles; Strattan, J Seth; Bernstein, Bradley; Snyder, Michael; Kellis, Manolis; Stafford, William; Kundaje, Anshul.

Genome Biol ; 24(1): 79, 2023 04 18.

Article in English | MEDLINE | ID: mdl-37072822

ABSTRACT

A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying the best imputation methods and what measures meaningfully evaluate performance are open questions. We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging and confounded by distributional shifts from differences in data collection and processing over time, the amount of available data, and redundancy among performance measures. Our analyses suggest simple steps for overcoming these issues and promising directions for more robust research.

Subject(s)

Algorithms , Epigenomics , Genomics/methods

14.

Whole-genome sequencing reveals a complex African population demographic history and signatures of local adaptation.

Fan, Shaohua; Spence, Jeffrey P; Feng, Yuanqing; Hansen, Matthew E B; Terhorst, Jonathan; Beltrame, Marcia H; Ranciaro, Alessia; Hirbo, Jibril; Beggs, William; Thomas, Neil; Nyambo, Thomas; Mpoloka, Sununguko Wata; Mokone, Gaonyadiwe George; Njamnshi, Alfred; Folkunang, Charles; Meskel, Dawit Wolde; Belay, Gurja; Song, Yun S; Tishkoff, Sarah A.

Cell ; 186(5): 923-939.e14, 2023 03 02.

Article in English | MEDLINE | ID: mdl-36868214

ABSTRACT

We conduct high coverage (>30×) whole-genome sequencing of 180 individuals from 12 indigenous African populations. We identify millions of unreported variants, many predicted to be functionally important. We observe that the ancestors of southern African San and central African rainforest hunter-gatherers (RHG) diverged from other populations >200 kya and maintained a large effective population size. We observe evidence for ancient population structure in Africa and for multiple introgression events from "ghost" populations with highly diverged genetic lineages. Although currently geographically isolated, we observe evidence for gene flow between eastern and southern Khoesan-speaking hunter-gatherer populations lasting until â¼12 kya. We identify signatures of local adaptation for traits related to skin color, immune response, height, and metabolic processes. We identify a positively selected variant in the lightly pigmented San that influences pigmentation in vitro by regulating the enhancer activity and gene expression of PDPK1.

Subject(s)

Acclimatization , Skin Pigmentation , Humans , Whole Genome Sequencing , Population Density , Africa , 3-Phosphoinositide-Dependent Protein Kinases

15.

Selective whole-genome amplification reveals population genetics of Leishmania braziliensis directly from patient skin biopsies.

Pilling, Olivia A; Reis-Cunha, João L; Grace, Cooper A; Berry, Alexander S F; Mitchell, Matthew W; Yu, Jane A; Malekshahi, Clara R; Krespan, Elise; Go, Christina K; Lombana, Cláudia; Song, Yun S; Amorim, Camila F; Lago, Alexsandro S; Carvalho, Lucas P; Carvalho, Edgar M; Brisson, Dustin; Scott, Phillip; Jeffares, Daniel C; Beiting, Daniel P.

PLoS Pathog ; 19(3): e1011230, 2023 03.

Article in English | MEDLINE | ID: mdl-36940219

ABSTRACT

In Brazil, Leishmania braziliensis is the main causative agent of the neglected tropical disease, cutaneous leishmaniasis (CL). CL presents on a spectrum of disease severity with a high rate of treatment failure. Yet the parasite factors that contribute to disease presentation and treatment outcome are not well understood, in part because successfully isolating and culturing parasites from patient lesions remains a major technical challenge. Here we describe the development of selective whole genome amplification (SWGA) for Leishmania and show that this method enables culture-independent analysis of parasite genomes obtained directly from primary patient skin samples, allowing us to circumvent artifacts associated with adaptation to culture. We show that SWGA can be applied to multiple Leishmania species residing in different host species, suggesting that this method is broadly useful in both experimental infection models and clinical studies. SWGA carried out directly on skin biopsies collected from patients in Corte de Pedra, Bahia, Brazil, showed extensive genomic diversity. Finally, as a proof-of-concept, we demonstrated that SWGA data can be integrated with published whole genome data from cultured parasite isolates to identify variants unique to specific geographic regions in Brazil where treatment failure rates are known to be high. SWGA provides a relatively simple method to generate Leishmania genomes directly from patient samples, unlocking the potential to link parasite genetics with host clinical phenotypes.

Subject(s)

Genome, Protozoan , Leishmaniasis, Cutaneous , Parasitology , Skin , Genome, Protozoan/genetics , Humans , Genetics, Population , Skin/parasitology , Brazil , Leishmaniasis, Cutaneous/parasitology , Parasitology/methods , Leishmania braziliensis/genetics

16.

Enzyme Activity Prediction of Sequence Variants on Novel Substrates using Improved Substrate Encodings and Convolutional Pooling.

Xu, Zhiqing; Wu, Jinghao; Song, Yun S; Mahadevan, Radhakrishnan.

Proc Mach Learn Res ; 165: 78-87, 2022 Nov.

Article in English | MEDLINE | ID: mdl-36530936

ABSTRACT

Protein engineering is currently being revolutionized by deep learning applications, especially through natural language processing (NLP) techniques. It has been shown that state-of-the-art self-supervised language models trained on entire protein databases capture hidden contextual and structural information in amino acid sequences and are capable of improving sequence-to-function predictions. Yet, recent studies have reported that current compound-protein modeling approaches perform poorly on learning interactions between enzymes and substrates of interest within one protein family. We attribute this to low-grade substrate encoding methods and over-compressed sequence representations received by downstream predictive models. In this study, we propose a new substrate-encoding based on Extended Connectivity Fingerprints (ECFPs) and a convolutional-pooling of the sequence embeddings. Through testing on an activity profiling dataset of haloalkanoate dehalogenase superfamily that measures activities of 218 phosphatases against 168 substrates, we show substantial improvements in predictive performances of compound-protein interaction modeling. In addition, we also test the workflow on three other datasets from the halogenase, kinase and aminotransferase families and show that our pipeline achieves good performance on these datasets as well. We further demonstrate the utility of this downstream model architecture by showing that it achieves good performance with six different protein embeddings, including ESM-1b (Rives et al., 2021), TAPE (Rao et al., 2019), ProtBert, ProtAlbert, ProtT5, and ProtXLNet (Elnaggar et al., 2021). This study provides a new workflow for activity prediction on novel substrates that can be used to engineer new enzymes for sustainability applications.

17.

Functional genomics of OCTN2 variants informs protein-specific variant effect predictor for Carnitine Transporter Deficiency.

Koleske, Megan L; McInnes, Gregory; Brown, Julia E H; Thomas, Neil; Hutchinson, Keino; Chin, Marcus Y; Koehl, Antoine; Arkin, Michelle R; Schlessinger, Avner; Gallagher, Renata C; Song, Yun S; Altman, Russ B; Giacomini, Kathleen M.

Proc Natl Acad Sci U S A ; 119(46): e2210247119, 2022 Nov 16.

Article in English | MEDLINE | ID: mdl-36343260

ABSTRACT

Genetic variants in SLC22A5, encoding the membrane carnitine transporter OCTN2, cause the rare metabolic disorder Carnitine Transporter Deficiency (CTD). CTD is potentially lethal but actionable if detected early, with confirmatory diagnosis involving sequencing of SLC22A5. Interpretation of missense variants of uncertain significance (VUSs) is a major challenge. In this study, we sought to characterize the largest set to date (n = 150) of OCTN2 variants identified in diverse ancestral populations, with the goals of furthering our understanding of the mechanisms leading to OCTN2 loss-of-function (LOF) and creating a protein-specific variant effect prediction model for OCTN2 function. Uptake assays with 14C-carnitine revealed that 105 variants (70%) significantly reduced transport of carnitine compared to wild-type OCTN2, and 37 variants (25%) severely reduced function to less than 20%. All ancestral populations harbored LOF variants; 62% of green fluorescent protein (GFP)-tagged variants impaired OCTN2 localization to the plasma membrane of human embryonic kidney (HEK293T) cells, and subcellular localization significantly associated with function, revealing a major LOF mechanism of interest for CTD. With these data, we trained a model to classify variants as functional (>20% function) or LOF (<20% function). Our model outperformed existing state-of-the-art methods as evaluated by multiple performance metrics, with mean area under the receiver operating characteristic curve (AUROC) of 0.895 ± 0.025. In summary, in this study we generated a rich dataset of OCTN2 variant function and localization, revealed important disease-causing mechanisms, and improved upon machine learning-based prediction of OCTN2 variant function to aid in variant interpretation in the diagnosis and treatment of CTD.

Subject(s)

Carnitine , Organic Cation Transport Proteins , Humans , Solute Carrier Family 22 Member 5/genetics , Solute Carrier Family 22 Member 5/metabolism , Organic Cation Transport Proteins/genetics , Organic Cation Transport Proteins/metabolism , HEK293 Cells , Carnitine/genetics , Carnitine/metabolism , Genomics

18.

Robust and annotation-free analysis of alternative splicing across diverse cell types in mice.

Benegas, Gonzalo; Fischer, Jonathan; Song, Yun S.

Elife ; 112022 03 01.

Article in English | MEDLINE | ID: mdl-35229721

ABSTRACT

Although alternative splicing is a fundamental and pervasive aspect of gene expression in higher eukaryotes, it is often omitted from single-cell studies due to quantification challenges inherent to commonly used short-read sequencing technologies. Here, we undertake the analysis of alternative splicing across numerous diverse murine cell types from two large-scale single-cell datasets-the Tabula Muris and BRAIN Initiative Cell Census Network-while accounting for understudied technical artifacts and unannotated events. We find strong and general cell-type-specific alternative splicing, complementary to total gene expression but of similar discriminatory value, and identify a large volume of novel splicing events. We specifically highlight splicing variation across different cell types in primary motor cortex neurons, bone marrow B cells, and various epithelial cells, and we show that the implicated transcripts include many genes which do not display total expression differences. To elucidate the regulation of alternative splicing, we build a custom predictive model based on splicing factor activity, recovering several known interactions while generating new hypotheses, including potential regulatory roles for novel alternative splicing events in critical genes like Khdrbs3 and Rbfox1. We make our results available using public interactive browsers to spur further exploration by the community.

Cells are the basic building blocks of all living things. There are numerous types of cells, and each cell has its own machinery to fulfill a specialised role. Despite their different purposes, most cells contain the same instructions, stored as DNA, on how to assemble the proteins needed to perform their intended functions. Cell types often vary in the frequency that each gene is read, leading to different quantities of proteins produced. Moreover, a process known as alternative splicing enables cells to build multiple proteins from the same gene. It works by joining fragments of a gene's code in various combinations. The resulting RNA sequences are molecular templates that cells use to assemble proteins. Analysing these RNA sequences reveals which genes are switched on in different tissues of the body, and what proteins are being made. However, despite recent advancements, alternative splicing is rarely studied in single cells because of some sizeable technical challenges. Benegas, Fischer and Song developed a computational toolkit designed to handle the unique challenges of analysing alternative splicing events in single cells. The analysis pipeline, called scQuint, was tested on two large datasets that capture cell-to-cell differences in the brain and other tissues of mice. Nearly all the cell types studied exhibited clear differences in alternative splicing, such that cell types could be distinguished based on their splicing profiles. Intriguing patterns of splicing were highlighted in some immune cells and certain types of neurons. Across cell types, the genes with unique splicing patterns were often not the same as those with unique activity patterns, indicating that gene expression and alternative splicing are two complementary processes. New types of alternative splicing events were also identified. Benegas et al. also developed a statistical model to probe the roles of splicing regulators in different cell types. In summary, the scQuint toolkit overcomes critical technical challenges typically encountered when analysing alternative splicing in single cells. It also reveals new insights about mechanisms of alternative splicing. The results are open access, made available using public interactive browsers, which should spur on other researchers to interrogate how alternative splicing differs in single cells.

Subject(s)

Alternative Splicing , RNA Splicing , Animals , Computational Biology/methods , Mice , RNA Splicing Factors/genetics , RNA-Binding Proteins/genetics , Software

19.

Transferability of Geometric Patterns from Protein Self-Interactions to Protein-Ligand Interactions.

Koehl, Antoine; Jagota, Milind; Erdmann-Pham, Dan D; Fung, Alexander; Song, Yun S.

Pac Symp Biocomput ; 27: 22-33, 2022.

Article in English | MEDLINE | ID: mdl-34890133

ABSTRACT

There is significant interest in developing machine learning methods to model protein-ligand interactions but a scarcity of experimentally resolved protein-ligand structures to learn from. Protein self-contacts are a much larger source of structural data that could be leveraged, but currently it is not well understood how this data source differs from the target domain. Here, we characterize the 3D geometric patterns of protein self-contacts as probability distributions. We then present a flexible statistical framework to assess the transferability of these patterns to protein-ligand contacts. We observe that the level of transferability from protein self-contacts to protein-ligand contacts depends on contact type, with many contact types exhibiting high transferability. We then demonstrate the potential of leveraging information from these geometric patterns to aid in ligand pose-selection problems in protein-ligand docking. We publicly release our extracted data on geometric interaction patterns to enable further exploration of this problem.

Subject(s)

Computational Biology , Proteins , Humans , Ligands , Machine Learning , Protein Binding , Proteins/metabolism

20.

Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention.

Bhattacharya, Nicholas; Thomas, Neil; Rao, Roshan; Dauparas, Justas; Koo, Peter K; Baker, David; Song, Yun S; Ovchinnikov, Sergey.

Pac Symp Biocomput ; 27: 34-45, 2022.

Article in English | MEDLINE | ID: mdl-34890134

ABSTRACT

The established approach to unsupervised protein contact prediction estimates coevolving positions using undirected graphical models. This approach trains a Potts model on a Multiple Sequence Alignment. Increasingly large Transformers are being pretrained on unlabeled, unaligned protein sequence databases and showing competitive performance on protein contact prediction. We argue that attention is a principled model of protein interactions, grounded in real properties of protein family data. We introduce an energy-based attention layer, factored attention, which, in a certain limit, recovers a Potts model, and use it to contrast Potts and Transformers. We show that the Transformer leverages hierarchical signal in protein family databases not captured by single-layer models. This raises the exciting possibility for the development of powerful structured models of protein family databases.

Subject(s)

Computational Biology , Proteins , Attention , Humans , Proteins/genetics , Sequence Alignment

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL